This practice is part of the subject Biomedical Data Science of the Degree in Data Science of the Universitat Politècnica de València, and taught by the Department of Applied Physics.
The measurement of data quality dimensions (DQ) is the central axis for the evaluation and improvement of data quality as well as for its correct and optimal use. Data quality dimensions are individual aspects or constructs that represent data quality attributes. To these can be associated one or more metrics, quantified with specific methods, as well as exploratory methods.
This practice is intended to provide an initial basis for the evaluation of DQ metrics. It will consist of the application of a series of methods for different dimensions of DQ. In the context of the maternal and child clinical setting, we will analyze a data file whose ultimate purpose is the monitoring of care indicators in this setting. Depending on the dimension, we will apply the methods and calculate the metrics both in general for the whole dataset and monitored over batches of time (by months), simulating the results of a DQ monitoring and continuous improvement system.
In some parts of the code we will find the text ##TODO## that we will need to complete. Additionally, we will have to discuss the results in those points where it is indicated. The deliverable of the practice will consist of the compilation in html of this R markdown file, using Knit, where the results of the execution and figures will be observed, and having completed the ##TODO## and commented the results.
We check that the working directory is in the one where we have the practice file and the folder with the data:
getwd()
Otherwise, we set it (change the example directory to ours):
setwd("D:/Users/Vartotojas/Documents/GitHub/BiomedicalDataScience_LAB05")
We install the required libraries and then load them into the working environment.
install.packages("zoo", repos = "http://cran.us.r-project.org")
install.packages("rts", repos = "http://cran.us.r-project.org")
install.packages("plotly", repos = "http://cran.us.r-project.org")
install.packages("devtools", repos = "http://cran.us.r-project.org")
library("devtools")
devtools::install_github('c5sire/datacheck')
devtools::install_github("hms-dbmi/EHRtemporalVariability")
library("zoo")
library("rts")
library("plotly")
library("datacheck")
library("EHRtemporalVariability")
We set the initial parameters of the data. The main date of the records, which will be used for the purpose of monitoring the delivery care indicators, is the date of birth.
# File name
fileName = "data/DQIinfantFeeding.csv"
# Whether it has a header or not
hasHeader = TRUE
# Main date column to be used for monitoring purposes
dateColumn = 'BIRTH_DATE'
# Format of the previous date
dateFormat = '%d/%m/%Y'
# Which text string will represent missing data
missingValue = ""
We load the file data/DQIinfantFeeding.csv in a data.frame named repository:
repository <- read.csv2(fileName, header=hasHeader, na.strings=missingValue)
# We collect the number of rows and columns
N <- nrow(repository)
D <- ncol(repository)
For monitoring purposes, we will use the zoo library (S3 Infrastructure for Regular and Irregular Time Series - Z’s Ordered Observations) to convert the data, the data.frame, to a format suited for batch analyses, the zoo format.
zooRepository <- read.zoo(repository,format = dateFormat,index.column = dateColumn)
One of the main uses of the maternal and infant data repository studied is the monitoring of quality of care indicators. In the field of newborn feeding, one of the most important indicators is whether there has been early initiation of breastfeeding in the delivery room. To calculate this indicator, we create the following function that will obtain the indicator for each batch of data received, so that we can apply it repeatedly for each batch given a frequency.
indicatorEBF_delroom <- function(dataset){
numerator = (dataset$NEO_MOMFBF_TYPE %in% 'During the 1st hour') &
(dataset$NEO_PFBF_TYPE %in% 'Delivery room') &
(dataset$DEL_MODE %in% c('Vaginal delivery', 'Thierry\'s spatulas', 'Forceps delivery', 'Vacuum delivery'))
denominator = (dataset$NEO_MOMFBF_TYPE %in% c('During the 1st hour', 'During the 2nd hour', 'During the 3rd hour','Breastfeeding does not start')) &
(dataset$NEO_PFBF_TYPE %in% c('Delivery room', 'Hospitalization unit', 'Breastfeeding does not start')) &
!(dataset$NEO_FBFEVAL_TY %in% 'Undesired breastfeeding') &
(dataset$DEL_MODE %in% c('Vaginal delivery', 'Thierry\'s spatulas', 'Forceps delivery', 'Vacuum delivery'))
indicator = sum(numerator)/sum(denominator) * 100
return(indicator)
}
Once the function is loaded in the environment, we can easily apply it to the batches of data at the desired time frequency using the apply family of functions from the xts (Raster Time Series Analysis) library. In this monthly case, we will use apply.monthly, to which we will pass as parameters the repository converted to zoo and the previously created function:
resIndicatorES2SC_delroom =apply.monthly(zooRepository, FUN=indicatorEBF_delroom)
plot(resIndicatorES2SC_delroom,xlab = "Date", ylab ="%",main = "Early breastfeeding start in the delivery room", ylim=c(0,100))
📝 DISCUSS RESULTS
When plotting the function by monthly data, we get a time series plot, which indicates the percentage of cases with early breastfeeding for the given time period.
We can see that a large chunk of data between mid-2009 and early 2011 is missing. We can also find that a majority of the graph lies between 80%, meaning that \(\frac{4}{5}\) of cases show early breastfeeding start in the delivery room.
We will find the missing data in the repository and calculate the corresponding metrics. First, for each variable:
NAmatrix <- !is.na(repository)
sumNAmatrix <- apply(NAmatrix,2,sum)
completenessByColumn <- round(sumNAmatrix/N*100,2)
completenessByColumn
## PATIENT_ID DEL_TYPE_A DEL_MODE DEL_TYPE_ANE DEL_IN_INTRACA
## 100.00 41.62 96.50 96.45 3.66
## BIRTH_DATE NEO_ES2SC_TYPE NEO_ES2SCP_TYPE NEO_MOMFBF_TYPE NEO_PFBF_TYPE
## 100.00 98.08 97.79 98.13 97.87
## NEO_FBFEVAL_TY NEO_APGAR_NUM NEO_WEIGHT_NUM NEO_PHCORD_TXT NEO_BF_DIS
## 97.79 31.31 95.42 70.64 95.90
Next, we will calculate and display the overall percentage of missing data:
completenessByDataset = sum(!is.na(repository))/(N*D) * 100
100 - completenessByDataset
## [1] 18.58984
📝 DISCUSS RESULTS
Applying the completenessByColumn function, we find that
the only two columns without missing data are the
PATIENT_ID and BIRTH_DATE. This might be
because they are the key columns for identifying the specific case, so
it is prohibited to leave these cells empty. We find that the most empty
columns are of DEL_IN_INTRACA (only \(3.66\%\) of values exist),
NEO_APGAR_NUM (\(31.31\%\)) and DEL_TYPE_A
(\(41.62\%\)).
The DEL_IN_INTRACA variable seems to represent some type
of complication during birth (such as “Risk of fetal compromise”,
“Failed labour induction”, “No labour progression” and etc.), which
explains why there are so little values - most cases don’t have
complications.
NEO_APGAR_NUM is a numeric variable with only one value
- \(10\). It is hard to say what this
specific variable points out about the patient, but since there are only
two possible outcomes (either 10 or ), we can
assume this variable acts as a TRUE or FALSE
classifier. If we would follow the theme, the empty cells could be
replaces with \(0\) values.
DEL_TYPE_A could indicate in what location the
anaesthetic was supplied. The missing values could point to no aesthetic
being used.
In regards to overall completeness of the data set, we find that \(18.58\%\) of values are missing in the data set. Although this number may cause some concern, we can also find that some missing values are not essential and can be replaced with other classifiers.
To monitor the completeness by temporary batches we will create a function that does this calculation for each batch it receives as a parameter, and returns the completeness for each column, the function dimCompletessByColumn:
dimCompletenessByColumn <- function(repository){
N = dim(repository)[1]
NAmatrix <- !is.na(repository)
sumNAmatrix <- apply(NAmatrix,2,sum)
completenessByColumn <- round(sumNAmatrix/N*100,2)
return(completenessByColumn)
}
Once the function is loaded in the environment, we can easily apply it to the batches of data at the desired time frequency using the apply family of functions from the xts (Raster Time Series Analysis) library. In this monthly case, we will use apply.monthly, to which we will pass as parameters the repository converted to zoo and the previously created function:
resCompletenessByColumn = apply.monthly(zooRepository, FUN=dimCompletenessByColumn
)
Now, we can create a plot with the results using the plotly library (Create Interactive Web Graphics via ‘plotly.js’). First for each variable:
p <-
plot_ly(
x = index(resCompletenessByColumn),
y = resCompletenessByColumn[, 1],
name = names(resCompletenessByColumn)[1],
type = 'scatter',
mode = 'lines'
) %>%
plotly::layout(
title = 'Completeness by month',
xaxis = list(title = "Date"),
yaxis = list (title = "Completeness (%)")
)
for (i in 2:ncol(resCompletenessByColumn)) {
p = p %>% plotly::add_trace(y = resCompletenessByColumn[, i],
name = names(resCompletenessByColumn)[i],
mode = 'lines')
}
p
And secondly globally, being able to calculate the result from the variable resCompletenessByColumn and use the code to plot a single series from the previous code chunk:
dimCompletenessByDataset <- function(repository){
N = dim(repository)[1]
D = dim(repository)[2]
completenessByDataset = sum(!is.na(repository))/(N*D) * 100
return(completenessByDataset)
}
resCompletenessByDataset = apply.monthly(zooRepository, FUN=dimCompletenessByDataset
)
p <-
plot_ly(
x = index(resCompletenessByColumn),
y = resCompletenessByDataset,
type = 'scatter',
mode = 'lines'
) %>%
plotly::layout(
title = 'Global completeness by month',
xaxis = list(title = "Date"),
yaxis = list (title = "Completeness (%)")
)
p
📝 DISCUSS RESULTS
When analyzing data completeness for each variable, we can see that there are inconsistencies throughout the years.
Most variables show data completeness not reaching under \(85\%\) for all the months (these variables
being PATIENT_ID, NEO_ES2SC_TYPE,
NEO_ES2SCP_TYPE, NEO_MOMFBF_TYPE,
NEO_PFBF_TYPE, NEO_FBFEVAL_TY and
NEO_BF_DIS, which reaches the lowest consistency in the
first month of \(85.53\%\)).
NEO_WEIGHT_NUM also shows relative completeness for the
majority of data, but the percentage plumits in the end of 2014,
reaching the lowest point of \(77.08\%\).
On further analysis, we find that a lot of values are lost in the
month of November, 2014. Both DEL_TYPE_ANE and
DEL_MODE take a huge dip in this month (reaching nearly
\(60\%\) points). We can see similar
dips in this month for the variables DEL_TYPE_A and
NEO_APGAR_SUM.
Another interesting thing to notice is the lack of data for the
variables NEO_PH_CORD and DEL_IN_INTRACA in
the first time portions of the data. NEO_PH_CORD has its
first input in April, 2010, and DEL_IN_INTRACA has its
first input in March, 2011. After these first inputs,
NEO_PH_CORD shows a large increase in data completeness, at
times reaching \(100\%\) for certain
months. The same can’t be said for the variable
DEL_IN_INTRACA, which hovers between \(0-15\%\) for the remainder of the data.
The last two variables DEL_TYPE_A and
NEO_APGAR_NUM have varying data completeness throughout the
months, similar to white noise. By the end of 2014, the monthly data
completeness reaches \(0\%\).
All of these incomplete variables lead to the heavily fluctuating
graph of global completeness by month, which starts off with low
completeness (around \(73\%\) due to
NEO_PH_CORD and DEL_IN_INTRACA) and later
fluctuates around \(83\%\), then
suddenly plumitting to a low point of \(68\%\) in November, 2014 (due to
DEL_TYPE_ANE and DEL_MODE).
We are going to analyze two multivariate consistency rules in the data. For this we will use the datacheck library (Tools for Checking Data Consistency), which allows us to write logical rules in R language, using the variable names of our data. These rules will be written in the attached file rules.R, the first one is provided as an example, and the second one should be coded based on the provided natural language expression rule.
# We read the rules file
rules = read_rules("rules.R")
# We evaluate the rules on our data
profile = datadict_profile(repository, rules)
# We show the account of errors for each rule
knitr::kable(profile$checks[,c(1,3,6)])
| Variable | Rule | Error.sum |
|---|---|---|
| DEL_MODE | !(sapply(DEL_MODE, tolower) %in% “elective caesarea”) | (sapply(DEL_TYPE_ANE, tolower) %in% c(“epidural anesthesia”, “epidural anesthesia / general anesthesia”, “spinal anesthesia”, “spinal anesthesia / general anesthesia”, “general anesthesia”)) | 220 |
| DEL_MODE | !(sapply(DEL_MODE, tolower) %in% “emergency caesarea”) | (sapply(DEL_TYPE_ANE, tolower) %in% c(“epidural anesthesia / general anesthesia”, “spinal anesthesia”, “spinal anesthesia / general anesthesia”, “general anesthesia”)) | 245 |
# We list the cases that have been marked as inconsistent for each rule
knitr::kable(profile$checks[,c(1,7)])
| Variable | Error.list |
|---|---|
| DEL_MODE | 472,500,525,583,636,638,654,683,698,699,729,793,802,811,821,838,846,865,866,890,918,969,975,985,987,1008,1010,1013,1066,1076,1082,1129,1150,1190,1194,1241,1246,1270,1300,1367,1383,1384,1388,1402,1411,1428,1433,1445,1446,1453,1459,1501,1508,1531,1559,1575,1578,1612,1616,1635,1665,1667,1669,1674,1685,1692,1719,1753,1770,1774,1781,1784,1787,1790,1791,1800,1802,1836,1839,1840,1855,1857,1861,1874,1883,1902,1925,1928,1933,1939,1950,1963,1970,1972,1984,1987,2000,2001,2010,2049,2055,2058,2066,2102,2122,2134,2135,2145,2169,2251,2278,2292,2322,2334,2348,2368,2369,2386,2401,2409,2457,2474,2493,2518,2530,2550,2551,2563,2583,2591,2592,2623,2624,2632,2647,2676,2679,2680,2686,2706,2716,2732,2762,2767,2778,2822,2857,2865,2883,2901,2907,2920,2923,2934,2935,2943,2963,2965,2981,2985,3002,3006,3016,3018,3024,3029,3039,3073,3099,3106,3107,3119,3126,3158,3179,3183,3188,3201,3216,3221,3222,3240,3254,3256,3259,3323,3344,3356,3359,3363,3365,3380,3391,3410,3418,3427,3443,3484,3519,3531,3550,3553,3560,3575,3592,3603,3618,3622,3627,3630,3638,3648,3660,3670,3676,3677,3698,3704,3707,3761 |
| DEL_MODE | 58,60,66,157,165,167,182,224,251,256,270,301,322,323,347,350,375,382,392,395,428,463,473,484,492,495,503,508,512,521,524,526,534,538,539,543,552,556,572,576,585,592,597,601,604,607,616,617,618,633,635,640,641,643,652,658,660,663,669,677,680,684,688,689,696,702,707,710,711,712,714,717,733,740,742,748,759,761,763,767,768,770,776,779,782,791,795,797,804,814,828,836,839,842,844,847,857,874,878,882,888,892,901,903,910,911,914,917,919,925,926,930,931,935,941,946,961,971,977,998,1007,1017,1022,1023,1024,1025,1032,1033,1035,1048,1050,1051,1052,1053,1056,1060,1062,1067,1072,1078,1084,1086,1088,1095,1103,1110,1111,1124,1126,1131,1132,1133,1135,1137,1139,1140,1143,1144,1146,1148,1163,1171,1179,1184,1187,1188,1191,1197,1199,1210,1214,1216,1220,1224,1234,1237,1244,1250,1255,1257,1258,1268,1273,1278,1282,1283,1286,1291,1294,1295,1297,1299,1308,1321,1329,1337,1353,1355,1357,1360,1361,1369,1376,1378,1380,1382,1385,1395,1400,1412,1413,1414,1417,1421,1426,1434,1438,1440,1442,1450,1456,1461,1464,1465,1472,1473,1476,1477,1479,1481,1488,1496,1504,1512,1519,1522,1537,1565,1568,1581,1583,1600,1604,1610,3149 |
📝 DISCUSS RESULTS
…
We are going to analyze if there are pattern changes in the data distributions over time. To do this we will use the EHRtemporalVariability library (Delineating Temporal Dataset Shifts in Electronic Health Records). First, we change to basic type Date the case date, originally in text format:
repositoryFormatted <- EHRtemporalVariability::formatDate(
input = repository,
dateColumn = "BIRTH_DATE",
dateFormat = dateFormat
)
We obtain the temporal maps from the already formatted repository using the function estimateDataTemporalMap and selecting a monthly period. We can get the help of the function by typing ?estimateDataTemporalMap in the console (it is only necessary to pass the data, the column with the date, and the desired period, the rest is obtained automatically from the data).
probMaps <- estimateDataTemporalMap(
data = repositoryFormatted,
dateColumnName = "BIRTH_DATE",
period = "month"
)
Next we obtain the information geometry plots from the previously estimated temporal maps. To do this we will use the function estimateIGTProjection on the list of temporal maps. We are going to save the results in a variable.
igtProjs <- sapply( probMaps, estimateIGTProjection )
names( igtProjs ) <- names( probMaps )
We can observe as an example the data temporal map and information geometry temporal (IGT) plot of the variable type of anesthesia during delivery DEL_TYPE_ANE:
plotDataTemporalMap(probMaps$DEL_TYPE_ANE)
plotIGTProjection(igtProjs$DEL_TYPE_ANE, trajectory = TRUE)
In this example, we can see that over time there are changes or differences associated with synonymy of terms (No anesthesia, no, Without anesthesia), or even differences between upper and lower case (spinal, Spinal).
Next, we are going to save the results of the temporal variability estimation in a file, so that we can upload them to the web application EHRtemporalVariability, in the Load your RData tab.
save(probMaps, igtProjs, file = "variabilityData.RData")
Either using the web application, or obtaining the graphs in RStudio, we will study the temporal evolution of the type of delivery variable DEL_MODE, and discuss the results and their potential implications in the problem defined at the beginning of the practice on the monitoring of the early onset of lactation indicator, as well as propose a possible solution.
📝 DISCUSS RESULTS
…
The maternal and infant data studied have been kindly provided by Ricardo García de León for educational purposes only. Their use or distribution outside this subject and practice is forbidden.